1 Introduction

Bioconductor provide stats of the project. If you are curious about what is the evolution of the downloads of a certain package or how does the downloads progress with time this is the right place

2 Load data

First we read the latest data from the Bioconductor project. There are two files, one with the download stats from 2009 until today and another with the download stats of the software packages, we will only use the first one:

web <- "https://www.bioconductor.org/packages/stats/bioc/"
stats <- read.delim(paste0(web, "bioc_pkg_stats.tab"), 
                    colClasses = c("factor", "character", "character", "numeric", "numeric"))
stats <- as.data.table(stats)
stats <- stats[Month != "all", , ]
monthsConvert <- function(x) {
  if (x == "Jan") {
    "01"
  } else if (x == "Feb"){
    "02"
  } else if (x == "Mar"){
    "03"
  } else if (x == "Apr"){
    "04"
  } else if (x == "May"){
    "05"
  } else if (x == "Jun"){
    "06"
  } else if (x == "Jul"){
    "07"
  } else if (x == "Aug"){
    "08"
  } else if (x == "Sep"){
    "09"
  } else if (x == "Oct"){
    "10"
  } else if (x == "Nov"){
    "11"
  } else if (x == "Dec"){
    "12"
  }
}
stats$Month <- sapply(stats$Month, monthsConvert)
stats$Date <- as.POSIXct(as.yearmon(paste(stats$Year, stats$Month, sep = "-")), frac = 1)
head(stats)
##    Package Year Month Nb_of_distinct_IPs Nb_of_downloads                Date
## 1: ABarray 2017    01                105             153 2017-01-01 01:00:00
## 2: ABarray 2017    02                155             229 2017-02-01 01:00:00
## 3: ABarray 2017    03                119             216 2017-03-01 01:00:00
## 4: ABarray 2017    04                184             272 2017-04-01 02:00:00
## 5: ABarray 2017    05                192             282 2017-05-01 02:00:00
## 6: ABarray 2017    06                  8               8 2017-06-01 02:00:00

There have been 1795 packages in Bioconductor. Some have been added recently and some later.

3 Packages

3.1 Number

First we explore the number of packages being downloaded by month:

stats2 <- stats[Nb_of_downloads != 0, ] # We remove rows of packages with a download in that month.
ggplot(stats2[, .(Number = .N), by = Date], aes(Date, Number)) +
  geom_bar(stat = "identity") + 
  theme_bw() +
  ggtitle("Packages downloaded") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Packages in Bioconductor with downloads

Figure 1: Packages in Bioconductor with downloads

The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages

ggplot(stats2[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
  geom_bar(stat = "identity") + 
  theme_bw() +
  ggtitle("Downloads") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Downloads of packages

Figure 2: Downloads of packages

Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.

pd <- position_dodge(0.1)
ggplot(stats2[, .(Number = mean(Nb_of_downloads), 
                  ymin = mean(Nb_of_downloads)-1.96*sd(Nb_of_downloads)/sqrt(.N),
                  ymax = mean(Nb_of_downloads)+1.96*sd(Nb_of_downloads)/sqrt(.N)), 
              by = Date], aes(Date, Number)) +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width=.1, position=pd) +
  geom_bar(stat = "identity") + 
  theme_bw() +
  ggtitle("Downloads") +
  ylab("Mean download for a package") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Downloads of packages per package. The error bar indicates the 95% confidence interval.

Figure 3: Downloads of packages per package
The error bar indicates the 95% confidence interval.

Here we can apreciate that the number of downloads per package hasn’t changed much with time. If something, now there is more dispersion between packages downloads.

3.2 Incorporations

This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.

today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats2[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) + 
  geom_bar(stat="identity") + 
  theme_bw() + 
  ggtitle("Packages with first download") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1))
New packages

Figure 4: New packages

We can see that there were more than 350 packages before 2009 in Bioconductor, and since them occasionally there is a raise to 50 new downloads (Which would be new packages being added).

3.3 Removed

Using a similar procedure we can approximate the packages deprecated and removed each month. In this case we look for the last date a package was downloaded, excluding the current month:

deprecation <- stats2[, .SD[which.max(Date)], by = Package, .SDcols = c("Date",  "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) + 
  geom_bar(stat = "identity") + 
  theme_bw() + 
  ggtitle("Packages without downloads") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1))
Aproximation to deprecated packages per month

Figure 5: Aproximation to deprecated packages per month

Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 80 packages as per last month. In total there are `r nrow(incorporation) - nrow(deprecation)) packages downloaded. We further explore how many time between the incorporation of the package and the last download.

df <- merge(incorporation, deprecation, by = "Package")
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(365*60*60*24) # Transform to years
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Years")
abline(v = mean(timeBioconductor), col = "red")
abline(v = median(timeBioconductor), col = "green")
Time of packages between first and last download

(#fig:time.package)Time of packages between first and last download

We can see that most deprecated packages are less than a year (I would say around two releases) and some stay on Bioconductor up to 6 years before deprecation. But those packages not deprecated how do they do in Bioconductor?

4 Packages downloads

4.1 Ratio downloads per IP

We can start comparing the number of downloads (different from 0) by how many IPs download each package.

ggplot(stats2, aes(Nb_of_distinct_IPs, Nb_of_downloads, col = Package)) + 
  geom_point() + 
  theme_bw() + 
  geom_smooth(method = "lm") + 
  xlab("Number of distinct IPs") + 
  ylab("Number of downloads") + 
  ggtitle("Downloads by different IP") +
  geom_abline(slope = 2) + 
  guides(col=FALSE)
Downloads and distinct IPs of all months and packages

Figure 6: Downloads and distinct IPs of all months and packages

Not surprisingly most of the package has two downloads from the same IP, one for each Bioconductor release (black line). However, there are some packages where few IPs download many times the same package, which may indicate that these packages are mostly installed in a few locations.

ratio <- stats2[, .(slope = coef(lm(Nb_of_downloads~Nb_of_distinct_IPs))[2]), by = Package]
ratio <- ratio[order(slope, decreasing = TRUE),]
ratio[slope >= 2, ] 
##               Package     slope
##   1:          mosaics 78.575352
##   2:         flowCore 48.032270
##   3:             xcms 25.645984
##   4:              mzR 23.597604
##   5:            minfi 15.869564
##  ---                           
## 103:            limma  2.021857
## 104:            crlmm  2.011828
## 105:     widgetInvoke  2.008765
## 106: pathprintGEOData  2.000000
## 107:     TCGAWorkflow  2.000000

We can see that the package with more downloads from the same IP is mosaics, followed by, flowCore, xcms and the forth one is mzR. The first one is for Chip-seq, the second one for flow cytometry, and the third and forth one is for chromatographically separated and single-spectra mass spectral data, maybe few locations use these packages.

I am curious how are the default packages of Bioconductor downloaded, let’s see where they are:

bioc_packages <- c("BiocInstaller", "Biobase", "BiocGenerics", "S4Vectors", "IRanges", "AnnotationDbi")
ratio[Package %in% bioc_packages, ]
##          Package    slope
## 1: BiocInstaller 2.365213
## 2: AnnotationDbi 1.927536
## 3:       IRanges 1.897957
## 4:     S4Vectors 1.792811
## 5:       Biobase 1.636272
## 6:  BiocGenerics 1.629061

Only BiocInstaller is installed more than once per IP.

Now we explore if there is some seasons cycles in the downloads, as in figure ?? seems to be some cicles.

4.2 By date

First we can explore the number of IPs per month downloading each package:

ggplot(stats2, aes(Date, Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw() +
  ggtitle("IPs") +
  ylab("Distinct IP downloads") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Distinct IP per package

Figure 7: Distinct IP per package

As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?

ggplot(stats2, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw() +
  ggtitle("Downloads per IP") +
  ylab("Downloads") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Downloads per year

Figure 8: Downloads per year

Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:

ggplot(stats2, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw() +
  ggtitle("Downloads per package every three months") +
  ylab("Downloads") +
  scale_x_datetime(date_breaks = "3 months") +
  ylim(0, 50000)+
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Downloads per year

Figure 9: Downloads per year

There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:

ggplot(stats2, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw() +
  ggtitle("Downloads per package every three months") +
  ylab("Downloads") +
  scale_x_datetime(date_breaks = "3 months") +
  ylim(0, 10000)+
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Downloads per year

Figure 10: Downloads per year

As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.

Maybe there is a relationship between the downloads and the number of IPs per date

ggplot(stats2, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw() +
  ggtitle("IPs") +
  ylab("Ratio") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE)
Ratio downloads per IP per package

Figure 11: Ratio downloads per IP per package

We can see some packages have ocasional raises of downloads per IP. But for small ranges we miss a lot of packages:

ggplot(stats2, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw() +
  ggtitle("IPs") +
  ylab("Ratio") +
  scale_x_datetime(date_breaks = "3 months") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col=FALSE) +
  ylim(1, 5)
Ratio downloads per IP per package

Figure 12: Ratio downloads per IP per package

But most of the packages seem to be more or less constant and around 2.

5 Models

One problem to compare the evolution of the packages is that they started at different moments, and as seen with time the number of downloads have been increasing as well as the number of packages. So we need to normalize the starting dates:

norm <- stats2[, .(Norm = as.numeric(Date)/as.numeric(max(Date)), 
                   Downloads = Nb_of_downloads/max(Nb_of_downloads)), by = Package]
ggplot(norm, aes(Norm, Downloads, col = Package)) + 
  geom_line() + 
  theme_bw() + 
  ggtitle("Downloads per stage of the package") +
  xlab("Date normalized") + 
  guides(col = FALSE)
Normalization of dates and downloads

Figure 13: Normalization of dates and downloads

We can observe a tendency to have a decrease of the number of downloads after being includedd in Bioconductor and later it raises again.

SessionInfo

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] lubridate_1.6.0   scales_0.4.1      zoo_1.8-0         dtplyr_0.0.2     
## [5] data.table_1.10.4 ggplot2_2.2.1     BiocStyle_2.4.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     knitr_1.15.1     magrittr_1.5     munsell_0.4.3   
##  [5] lattice_0.20-35  colorspace_1.3-2 R6_2.2.0         highr_0.6       
##  [9] stringr_1.2.0    plyr_1.8.4       dplyr_0.5.0      tools_3.4.0     
## [13] grid_3.4.0       gtable_0.2.0     DBI_0.6-1        htmltools_0.3.6 
## [17] assertthat_0.2.0 yaml_2.1.14      lazyeval_0.2.0   rprojroot_1.2   
## [21] digest_0.6.12    tibble_1.3.0     bookdown_0.3     evaluate_0.10   
## [25] rmarkdown_1.5    labeling_0.3     stringi_1.1.5    compiler_3.4.0  
## [29] backports_1.0.5